Skip to content

Health check feature for virtual router#3575

Merged
DaanHoogland merged 37 commits into
apache:masterfrom
shapeblue:vr_health
Jan 30, 2020
Merged

Health check feature for virtual router#3575
DaanHoogland merged 37 commits into
apache:masterfrom
shapeblue:vr_health

Conversation

@anuragaw
Copy link
Copy Markdown
Contributor

@anuragaw anuragaw commented Aug 28, 2019

We want to support more exhaustive health checks for VRs. This feature helps admins configuring health checks and also expands it's scope. There are two categories of health checks - basic and advanced (more expensive so should be run less frequently). The following checks have been added with a separate script -

  1. Services check (as per existing monitorServices.py) - basic check
  2. Disk space check against a threshold - basic check
  3. CPU usage check against a threshold - basic check
  4. Memory usage check against a threshold - basic check
  5. Router template and scripts version check - basic
  6. Connectivity to the gateways from router - basic
  7. DNS config match against MS - advanced check
  8. DHCP config match against MS - advanced check
  9. HA Proxy config match against MS (internal LB and public LB) - advance check
  10. Port forwarding match against MS in iptables. - advance check

Following global configs were added for configuring health checks:
• "router.health.checks.enabled" - If true, router health checks are allowed to be executed and read. If false, all scheduled checks and API calls for on demand checks are disabled. Default is true.
• "router.health.checks.basic.interval" - Interval in minutes at which basic router health checks are performed. If set to 0, no tests are scheduled. Default is 3 mins as per the existing monitor services.
• "router.health.checks.advanced.interval" - Interval in minutes at which advanced router health checks are performed. If set to 0, no tests are scheduled. Default value is 10 minutes .
• "router.health.checks.config .refresh.interval" - Interval in minutes at which router health checks config - such as scheduling intervals, excluded checks, etc is updated on virtual routers by the management server. This value should be sufficiently high (like 2x) from the router.health.checks.basic.interval and router.health.checks.advanced.interval so that there is time between new results generation for passed data. Default is 10 mins.
• "router.health.checks.results.fetch.interval" - Interval in minutes at which router health checks results are fetched by management server. On each result fetch, management server evaluates need to recreate VR as per configuration of router.health.checks.failures.to.recreate.vr. This value should be sufficiently high (like 2x) from the router.health.checks.basic.interval and router.health.checks.advanced.interval so that there is time between new results generation and fetch.
• "router.health.checks.failures.to.recreate.vr" - Health checks failures defined by this config are the checks that should cause router recreation. If empty the recreate is not attempted for any health check failure. Possible values are comma separated script names from systemvm’s /root/health_scripts/ (namely - cpu_usage_check.py, dhcp_check.py, disk_space_check.py, dns_check.py, gateways_check.py, haproxy_check.py, iptables_check.py, memory_usage_check.py, router_version_check.py), connectivity.test or services (namely - loadbalancing.service, webserver.service, dhcp.service)
• "router.health.checks.to.exclude" - Health checks that should be excluded when executing scheduled checks on the router. This can be a comma separated list of script names placed in the '/root/health_checks/' folder. Currently the following scripts are placed in default systemvm template - cpu_usage_check.py, disk_space_check.py, gateways_check.py, iptables_check.py, router_version_check.py, dhcp_check.py, dns_check.py, haproxy_check.py, memory_usage_check.py.
• "router.health.checks.free.disk.space.threshold" - Free disk space threshold (in MB) on VR below which the check is considered a failure. Default is 100MB.
• "router.health.checks.max.cpu.usage.threshold" - Max CPU Usage threshold as % above which check is considered a failure.
• "router.health.checks.max.memory.usage.threshold" - Max Memory Usage threshold as % above which check is considered a failure.

API Changes:

  • listRouters and listInternalLoadBalancers now optionally takes in a flag includehealthcheckresults (default false) to fetch the last health check results for the router.
  • getRouterHealthCheckResults - a new API is added to fetch health check results with an optional flag performfreshchecks to execute checks on demand. This execution is only disabled if "router.health.checks.enabled" is false. performfreshchecks = true means all data from Management server is sent to the router and fresh checks are executed. If false, we retrieve the previously executed result from router itself.

Additionally the feature looks into any executable script in /root/health_scripts/ directory and adds it's result as json output of the overall health checks config. This allows custom checks to be put in and custom systemvm templates can also support health checks.

UI shows router in alert state if health checks are failure.

The health checks can be manually triggered using new API added in the feature (CLI or UI both support this).

Description

Fixes: 3270

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

Integration tests, manually, CMK, UI

Screenshot from 2019-12-16 15-12-28
Screenshot from 2019-12-16 15-12-34
Screenshot from 2019-12-16 15-12-44
Screenshot from 2019-12-16 15-12-55
Screenshot from 2019-12-16 15-13-04

API Changes -
New parameters added to list routers-

(local) 🐵 > list routers includehealthcheckresults=true filter=id,healthchecksfailed,healthcheckresults
{
  "count": 1,
  "router": [
    {
      "healthcheckresults": [
        {
          "checkname": "connectivity",
          "checktype": "basic",
          "details": "Successfully fetched data",
          "lastupdated": "2019-12-16T15:14:06+0530",
          "success": true
        },
        {
          "checkname": "cpu_usage_check.py",
          "checktype": "basic",
          "details": "CPU Usage within limits with current at 1.7%",
          "lastupdated": "2019-12-16T15:12:38+0530",
          "success": true
        },
        {
          "checkname": "dhcp.service",
          "checktype": "basic",
          "details": "service is running",
          "lastupdated": "2019-12-16T15:12:38+0530",
          "success": true
        },
        {
          "checkname": "dhcp_check.py",
          "checktype": "advance",
          "details": "All 1 VMs are present in dhcphosts.txt",
          "lastupdated": "2019-12-16T15:12:41+0530",
          "success": true
        },
        {
          "checkname": "disk_space_check.py",
          "checktype": "basic",
          "details": "Sufficient free space is 345 MB",
          "lastupdated": "2019-12-16T15:12:41+0530",
          "success": true
        },
        {
          "checkname": "dns_check.py",
          "checktype": "advance",
          "details": "All 1 VMs are present in /etc/hosts",
          "lastupdated": "2019-12-16T15:12:41+0530",
          "success": true
        },
        {
          "checkname": "gateways_check.py",
          "checktype": "basic",
          "details": "All 1 gateways are reachable via ping",
          "lastupdated": "2019-12-16T15:12:41+0530",
          "success": true
        },
        {
          "checkname": "haproxy_check.py",
          "checktype": "advance",
          "details": "No data provided to check, skipping",
          "lastupdated": "2019-12-16T15:12:41+0530",
          "success": true
        },
        {
          "checkname": "iptables_check.py",
          "checktype": "advance",
          "details": "No portforwarding rules provided to check, skipping",
          "lastupdated": "2019-12-16T15:12:41+0530",
          "success": true
        },
        {
          "checkname": "loadbalancing.service",
          "checktype": "basic",
          "details": "service is running",
          "lastupdated": "2019-12-16T15:12:38+0530",
          "success": true
        },
        {
          "checkname": "memory_usage_check.py",
          "checktype": "basic",
          "details": "Memory Usage within limits with current at 23.704%",
          "lastupdated": "2019-12-16T15:12:38+0530",
          "success": true
        },
        {
          "checkname": "router_version_check.py",
          "checktype": "basic",
          "details": "Template and scripts version match successful",
          "lastupdated": "2019-12-16T15:12:41+0530",
          "success": true
        },
        {
          "checkname": "ssh.service",
          "checktype": "basic",
          "details": "service is running",
          "lastupdated": "2019-12-16T15:12:38+0530",
          "success": true
        },
        {
          "checkname": "webserver.service",
          "checktype": "basic",
          "details": "service is running",
          "lastupdated": "2019-12-16T15:12:38+0530",
          "success": true
        }
      ],
      "healthchecksfailed": false,
      "id": "920452d6-7951-4425-ba2c-aecb2ddaaf6b"
    }
  ]
}

And added new API - getRouterHealthCheckResults-

(local) 🐵 > get routerhealthcheckresults routerid="920452d6-7951-4425-ba2c-aecb2ddaaf6b  " performfreshchecks=true 
{
  "routerhealthchecks": {
    "healthchecks": [
      {
        "checkname": "connectivity.test",
        "checktype": "basic",
        "details": "Successfully fetched data",
        "lastupdated": "2019-12-16T15:19:47+0530",
        "success": true
      },
      {
        "checkname": "cpu_usage_check.py",
        "checktype": "basic",
        "details": "CPU Usage within limits with current at 2.4%",
        "lastupdated": "2019-12-16T15:19:43+0530",
        "success": true
      },
      {
        "checkname": "dhcp.service",
        "checktype": "basic",
        "details": "service is running",
        "lastupdated": "2019-12-16T15:19:43+0530",
        "success": true
      },
      {
        "checkname": "dhcp_check.py",
        "checktype": "advanced",
        "details": "All 1 VMs are present in dhcphosts.txt",
        "lastupdated": "2019-12-16T15:19:47+0530",
        "success": true
      },
      {
        "checkname": "disk_space_check.py",
        "checktype": "basic",
        "details": "Sufficient free space is 345 MB",
        "lastupdated": "2019-12-16T15:19:46+0530",
        "success": true
      },
      {
        "checkname": "dns_check.py",
        "checktype": "advanced",
        "details": "All 1 VMs are present in /etc/hosts",
        "lastupdated": "2019-12-16T15:19:47+0530",
        "success": true
      },
      {
        "checkname": "gateways_check.py",
        "checktype": "basic",
        "details": "All 1 gateways are reachable via ping",
        "lastupdated": "2019-12-16T15:19:46+0530",
        "success": true
      },
      {
        "checkname": "haproxy_check.py",
        "checktype": "advanced",
        "details": "No data provided to check, skipping",
        "lastupdated": "2019-12-16T15:19:47+0530",
        "success": true
      },
      {
        "checkname": "iptables_check.py",
        "checktype": "advanced",
        "details": "No portforwarding rules provided to check, skipping",
        "lastupdated": "2019-12-16T15:19:47+0530",
        "success": true
      },
      {
        "checkname": "loadbalancing.service",
        "checktype": "basic",
        "details": "service is running",
        "lastupdated": "2019-12-16T15:19:43+0530",
        "success": true
      },
      {
        "checkname": "memory_usage_check.py",
        "checktype": "basic",
        "details": "Memory Usage within limits with current at 23.8486%",
        "lastupdated": "2019-12-16T15:19:43+0530",
        "success": true
      },
      {
        "checkname": "router_version_check.py",
        "checktype": "basic",
        "details": "Template and scripts version match successful",
        "lastupdated": "2019-12-16T15:19:46+0530",
        "success": true
      },
      {
        "checkname": "ssh.service",
        "checktype": "basic",
        "details": "service is running",
        "lastupdated": "2019-12-16T15:19:43+0530",
        "success": true
      },
      {
        "checkname": "webserver.service",
        "checktype": "basic",
        "details": "service is running",
        "lastupdated": "2019-12-16T15:19:43+0530",
        "success": true
      }
    ],
    "routerid  ": "920452d6-7951-4425-ba2c-aecb2ddaaf6b"
  }
}

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✖centos7 ✔debian. JID-274

@svenvogel
Copy link
Copy Markdown
Contributor

svenvogel commented Sep 11, 2019

@anuragaw i like definitively this feature. how does it work?

@yadvr yadvr added this to the 4.14.0.0 milestone Sep 19, 2019
@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-286

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-381)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33789 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3575-t381-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Smoke tests completed. 77 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@anuragaw anuragaw force-pushed the vr_health branch 2 times, most recently from b081006 to 9acaadd Compare November 5, 2019 10:43
@DaanHoogland DaanHoogland reopened this Nov 7, 2019
@anuragaw
Copy link
Copy Markdown
Contributor Author

Rebased and cleaned up UI, refactored into separate scripts and tested in and out thoroughly to fix some cases around internal lb vm related scripts.

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-347

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan test

@anuragaw
Copy link
Copy Markdown
Contributor Author

anuragaw commented Nov 12, 2019

ping @rhtyd , @nvazquez , @DaanHoogland, @Spaceman1984, @shwstppr - ping for review.

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan test

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-352

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from andrijapanicsb Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from andrijapanicsb Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from anuragaw Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from anuragaw Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@apache apache deleted a comment from blueorangutan Jan 29, 2020
@onitake
Copy link
Copy Markdown
Contributor

onitake commented Jan 29, 2020

Cc @Doni7722 .

This looks pretty cool!

The included checks may actually be enough for our use case - but inserting custom checks seems a bit inconvenient. If I understand correctly, we'd have to build our own VR image?

Also, I don't think putting the checker script and temporary configs into /root is a good idea - scripts should live in a more standardised location like /usr/lib/cloudstack, while dynamic configs should go to /var/lib/cloudstack or similar.

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@onitake you can just build custom systemvm.iso, i.e. extract the iso, extract the tgz, add scripts to the "root/health-checks/" folder and package back to tgz, and repack iso. Or simply have automation that will connect via ssh to all VRS and drop file in the folder, if it's not already there.

@onitake
Copy link
Copy Markdown
Contributor

onitake commented Jan 29, 2020

@andrijapanicsb Well, building a custom ISO is not exactly convenient either. But in most cases, the added preparation step and a build job on a CI system would be worth it. If you have to change checks often, this will be very inconvenient, however.

Pushing changes to VRs directly is something we want to avoid, as this can lead to strange problems if something is missed. We did something like that in the past. But, paired with updating the image, it might be a feasible option.

On the other hand, adding an API to inject custom scripts into VRs is (obviously) a big security risk.
In any case, any input to health check execution must be properly sanitized to avoid injection attacks.

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-845)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30723 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3575-t845-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_password_server.py
Smoke tests completed. 78 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@DaanHoogland
Copy link
Copy Markdown
Contributor

looks ready for merge, rekicking travis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants